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Abstract 

Background: The aim of connectivity mapping is to match drugs using drug-treatment gene expression profiles 
from multiple cell lines. This can be viewed as an information retrieval task, with the goal of finding the most relevant 
profiles for a given query drug. We infer the relevance for retrieval by data-driven probabilistic modeling of the drug 
responses, resulting in probabilistic connectivity mapping, and further consider the available cell lines as different data 
sources. We use a special type of probabilistic model to separate what is shared and specific between the sources, in 
contrast to earlier connectivity mapping methods that have intentionally aggregated all available data, neglecting 
information about the differences between the cell lines. 

Results: We show that the probabilistic multi-source connectivity mapping method is superior to alternatives in 
finding functionally and chemically similar drugs from the Connectivity Map data set. We also demonstrate that an 
extension of the method is capable of retrieving combinations of drugs that match different relevant parts of the 
query drug response profile. 

Conclusions: The probabilistic modeling-based connectivity mapping method provides a promising alternative to 
earlier methods. Principled integration of data from different cell lines helps to identify relevant responses for specific 
drug repositioning applications. 

Keywords: Connectivity mapping, Data integration, Gene expression, Latent variable models, Probabilistic modeling 



Background 

Current widespread application of high-throughput tran- 
scriptional profiling has made large collections of drug- 
treatment gene expression data both possible and feasible. 
One of the most important such databases is the Con- 
nectivity Map (CMap) [1] that allows users to match 
transcriptional profiles elicited by drug treatments and 
diseases. The idea is that any perturbation to the genome- 
wise gene expression can be summarized by a proper 
gene signature. Such signatures can be obtained using 
microarray data and used as proxies of disease phe- 
notypes and drug effects. Matching drugs and diseases 
based on these signatures is known as connectivity map- 
ping, and it has shown promise in drug discovery and 
repositioning [2-5]. CMaps successor, the Library of Inte- 
grated Network-based Cellular Signatures (LINCS, http:// 
www.lincsproject.org/), will offer data for thousands of 
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compounds on tens of cell lines in the near future, pro- 
viding a unique resource for connectivity mapping -based 
drug discovery. 

Connectivity mapping can be seen as an information 
retrieval problem, where the task is to find the most rele- 
vant gene expression profile for a given query drug profile. 
The key to successful retrieval is a good definition of the 
relevance measure. Current connectivity mapping meth- 
ods define relevance based on similarity in the sets of top 
up- and down-regulated genes between the two measure- 
ment profiles [1] or the consensus profiles constructed by 
combining all measurement samples for a given drug [2]. 
Using non-parametric rank-based statistics to define the 
similarity [6], these methods can integrate data from mul- 
tiple measurement platforms while reducing batch effects. 
Alternatively, one could use the Pearson correlation to 
compute the similarity, but it is more sensitive to platform 
differences [1]. 

Transcriptional drug-treatment databases, such as 
CMap and LINCS, provide measurement data for vari- 
ous experimental factors, including multiple cell types, 
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doses, and time points. So far, data over multiple experi- 
mental factors has been aggregated into a consensus view 
[2], but this method intentionally ignores possible cell- 
line-specific effects of the drugs [4] . With the number of 
experimental factors growing notably in the future, data 
integration methods capable of distinguishing cell-line- 
specific effects and various types of consensus or common 
effects would be needed to bring out the full benefits from 
connectivity mapping. 

In this paper, we propose an alternative, probabilistic 
model-based approach for defining relevance, with the 
assumption that a suitably chosen probabilistic model can 
detect relevant effects from the noisy data. If the repre- 
sentation that the model provides is more informative and 
less noisy than the input data, retrieval is then more pre- 
cise based on the model instead of based on the noisy 
original data. For tractability, we assume that the tran- 
scriptional effects caused by drug treatments consists of 
a set of processes that generate partly overlapping pat- 
terns in the observations, and model each process as a 
probabilistic latent factor of data. 

Assume then that some of the factors are shared by sub- 
sets of the cell lines, and some are specific to individual 
cell lines. When searching for drugs for a specific type of 
cancer, for instance, effects in those cell lines are then rele- 
vant, and it would be natural to define relevance as activity 
in those factors. 

Relevance stems from the goal of the analyst, and 
can alternatively be to find effects specific to one cell 
line. If there are several relevant cell lines, however, a 
nice side benefit follows: The data contains noise from 
various sources in addition to the signal, such as mea- 
surement batch effects, and the noise is, by definition, 
specific to individual cell lines. If relevance is defined 
in terms of the shared activity, it is more tolerant to 
noise. 

What remains now is to find a method to integrate data 
sources to identify shared patterns. A classical method 
is the Canonical Correlation Analysis (CCA, [7]), which 
seeks statistical dependencies between two data sets with 
paired samples. CCA has been applied for multiple bio- 
logical problems [8-10]. However, for the general connec- 
tivity mapping problem, CCA is not sufficient as it only 
searches for the shared factors and needs to be generalized 
to multiple data sources. 

A recent data integration method, called Group Factor 
Analysis (GFA, [11]), is a generalization of CCA directly 
suitable for the task. GFA decomposes the transcriptional 
response data into factors specific to individual cell lines 
and factors shared by two or more cell lines. The name 
comes from the analysis of groups of variables, here one 
group for one cell line. Besides being a generalization 
of CCA, the method generalizes standard factor analy- 
sis from finding relationships between scalar variables to 



finding relationships between groups of variables, or data 
sources. 

Data integration with GFA is one key novel aspect in 
our method, as the earlier connectivity mapping meth- 
ods intentionally did not study which responses generalize 
across the cell lines and which do not. The consensus- 
based method [2] assumes that only the general effects 
of drugs are relevant, effectively discarding any specific 
effects as noise. This is optimal only in the case of drugs 
with similar effects across cell lines, but this is not always 
true and hence the consensus-based method is overly 
restrictive. GFA scales to an arbitrary number of data 
sources, and the Bayesian probabilistic modeling makes it 
possible to cope with the biggest problem of gene expres- 
sion data, the "large p small n problem of having a 
relatively large number of variables (genes) compared to 
the number of samples. 

Given the probabilistic model, retrieval of the relevant 
drug response profiles is then performed based on an 
activity profile over the factors, or alternatively the latent 
factor representation, the model has learned from data. 
We call the approach probabilistic connectivity mapping 
(Figure 1). A suitable relevance measure is the Pearson 
correlation, as it focuses on the active (non-zero) factors 
of the query and ignores the inactive ones. Depending on 
the goal, the analyst can choose to focus on factors shared 
by cell lines, specific factors, or both. 

We apply the method to the CMap data and show that it 
outperforms earlier connectivity mapping approaches in 
finding functionally and chemically similar drugs. Addi- 
tionally, the careful data integration helps: Shared factors 
are the most relevant for the retrieval, but some specific 
factors are relevant as well. This indicates that while most 
drugs exhibit similar responses across cell lines, there are 
also some important differences that are captured by our 
model. 

Alternatively to GFA, a more straightforward proba- 
bilistic factor analysis can also be used by simply concate- 
nating data from all cell lines and not taking into account 
the grouping of the variables according to cell lines. We 
will consider this alternative as well; GFA is expected 
to have the advantage that interpretation of the factors 
should be easier as they explicitly specialize to a subset of 
cell lines, but the retrieval performances are expected to 
be similar. 

In addition to retrieval of single drugs, we demon- 
strate how the model-based approach can be extended to 
retrieve combinations of drugs. The idea is to retrieve a 
set of drugs, where each drug matches a different part 
of the relevant query response. This is beneficial for 
polypharmacology, where drugs have multiple target effects 
[12,13]. We demonstrate that combinatorial retrieval 
can provide complementary information to single-drug 
retrieval for polypharmacologic drugs. 
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Figure 1 Overview to probabilistic connectivity mapping. The input data for probabilistic connectivity mapping are a collection of 
drug-treatment gene expression profiles, measured on multiple cell lines. Probabilistic modeling, here Group Factor Analysis, is applied to explain 
the data in terms of a set of factors Zand their loadings W. The factors can be active in one or more cell lines, capturing both specific and shared 
drug response effects. For the probabilistic connectivity mapping, a relevance measure between two drugs is finally defined as a similarity of their 
factor activities z /; computed in practice as the Pearson correlation. 



Data integration via probabilistic modeling is expected 
to bring a couple of further benefits. As the strengths 
of the responses vary widely, and the data is expected 
to be heteroskedastic, fixed signature sizes used in cur- 
rent connectivity mapping approaches may lose impor- 
tant information. The probabilistic modeling approach 
copes with varying sample norm in a natural fashion. A 
final benefit is the ability to cope with batch effects that 
plague microarray experiments. They are view-specific by 
nature, so retrieval that focuses on the shared effects can 
help to further reduce the batch effects, complementing 
preprocessing procedures such as mean-centering [14]. 



Results and discussion 

Connectivity mapping results 

We evaluated the proposed probabilistic connectivity 
mapping approach by applying it to a collection of 718 
compounds and three cell lines from the CMap database, 
normalized with mean-centering [14]. The gene expres- 
sion profiles were modeled across the set of 930 Landmark 
genes identified in the LINCS project. Three probabilis- 
tic models were used: Group Factor Analysis (GFA), 
sparse factor analysis (sFA), and Bayesian principal com- 
ponent analysis (BPCA). As comparison, we used two ear- 
lier connectivity mapping methods: rank-based average 
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enrichment-score distance (AESD, [2]) and correlation 
(COR) on the differential expression data averaged over 
the cell lines. We evaluated the retrieval performance 
based on two external "ground truths" on relevance: how 
many of the retrieved samples have the same fourth level 
ATC codes as the query drug and chemical similarity. We 
measured retrieval performance with two complementary 
goodness measures: partial area under the ROC curve and 
top- 10 mean average precision (MAP). 

Probabilistic connectivity mapping with GFA and sFA 
clearly outperform the other methods (Figure 2) on both 
ground truths and goodness measures. The sFA was 
slightly better with the partial AUC measure and GFA 
for the top- 10 MAP measure. Bayesian PCA clearly per- 
formed worse, indicating that the sparsity assumptions 
made in GFA and sFA are important for capturing the 
relevant responses from the data. 

In the experiments of Figure 2, we used all factors, as 
that turned out to produce the best absolute retrieval 
performance for this data. We next investigated the pos- 
sible benefits of focusing on the factors shared by the 
cell lines. The retrieval was based on the most active 
shared factors (from GFA), and compared to the perfor- 
mance with an equal number of the most active factors 
that are specific to one cell line. Additionally, we com- 
pared this to the most active factors from sFA. Figure 3 
shows that the shared factors produce better retrieval 
almost everywhere. These results suggest that the explicit 



group-wise sparsity assumption in GFA, resulting in the 
decomposition to shared and specific effects, is beneficial 
in modeling data from multiple cell lines. 

Combinatorial retrieval results 

We next studied how well the method extends to com- 
binatorial retrieval, that is, retrieval of multiple drugs 
that together are relevant to the query. We queried with 
drugs having multiple ATC codes, and the ground truth 
(unknown to the model) was the set of ATC codes. Our 
hypothesis was that if some of the ATC codes represent 
minor response effects, drugs with those codes would 
not get a high relevance score when retrieving single 
drugs, as the drugs with the other code(s) would dom- 
inate. However, the minority codes could show up in 
combinatorial retrieval. We also expect the combinato- 
rial retrieval to work better when the multiple effects of 
the query are more varied, as the effects would then get 
less mixed up. Figure 4 shows an example of combina- 
torial retrieval results and compares them to single-drug 
retrieval results. Comparisons of the retrieval perfor- 
mance are summarized in Figure 5. We see that combina- 
torial retrieval improves the results for a good proportion 
of the polypharmacologic drugs, and that performance is 
better with lower ATC levels, that is, more distinguished 
effects. 

As single-drug retrieval is expected to work, even for 
polypharmacologic drugs, when searching for drugs with 
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Connectivity mapping: • probabilistic earlier 

Figure 2 Probabilistic connectivity mapping outperforms earlier alternatives in retrieving similar drugs. The retrieval performance is 
indicated with two goodness measures (top row: partial AUC, bottom row: top-1 0 MAP) and two ground truths (left: ATC codes, right: Tanimoto 
similarity of the 2D fingerprints of the drugs). Probabilistic connectivity mapping (purple color) is performed with three models: Group Factor 
Analysis (GFA), sparse factor analysis (sFA), and Bayesian PCA (BPCA). These are compared to two earlier connectivity mapping methods (orange 
color): rank-based average enrichment-score distance (AESD) and the Pearson correlation over the differential expression profiles (COR). 
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Figure 3 Factors shared across multiple cell lines are more informative for retrieval performance than cell-line-specific factors. Retrieval 
performance is shown for the top shared (solid line) and specific (dotted line) factors from GFA (green color) and sFA (brown color), as a function of 
the number of top factors. Factors were selected based on the highest a parameter values. 



precisely the same combination of effects, we removed 
drugs having multiple ATC matches with the query drug 
from the retrieved set. After that, performance com- 
pared to single-drug retrieval clearly improved (Figure 5), 
indicating that combinatorial retrieval was able to find 
additional drug combinations and provide complemen- 
tary information to single-drug retrieval. 



Conclusions 

We introduced probabilistic connectivity mapping, a 
model-based alternative to earlier drug connectivity map- 
ping methods. Our first contribution was to define the 
relevance for the information retrieval task based on 
a probabilistic model that captures the relevant gene 
expression effects for the query drug in the form of 
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Figure 4 Combinatorial retrieval example. Using scopolamine as the query drug, the top-1 0 retrieval results are shown for single-drug and 
combinatorial retrieval, with ATC codes shared with the query indicated by colors. For combinatorial retrieval, the drugs are ranked (CombRank) 
based on their first appearance in the retrieved pairs (either CombDrugl orCombDrug2). In the example, using both single-drug and combinatorial 
retrieval, a match for ATC code N05 is found at the first rank. However, combinatorial retrieval also provides a match for the other ATC code S01 
already at the first rank, whereas single-drug retrieval finds a match only at rank 9. The result demonstrates that the combinatorial retrieval approach 
can be beneficial for polypharmacologic queries. 
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Figure 5 Combinatorial retrieval provides additional information to complement single-drug retrieval. The y-axis indicates the proportion 
of the query drugs for which combinatorial retrieval improves the rank for the first hit for at least one ATC code (random performance: 0.5). The 
results are shown for four different ATC code levels (x-axis). Red: retrieval from the full set; blue: retrieval after removing drugs having multiple ATC 
matches with the query. 



probabilistic latent factors inferred from data. The chosen 
model integrates data over available experimental factors, 
here cell lines, which has not been considered in earlier 
connectivity mapping approaches. We showed that proba- 
bilistic connectivity mapping outperforms earlier alterna- 
tives in finding functionally and chemically similar drugs, 
based on transcriptional response profiles. We addition- 
ally showed that gene expression response factors shared 
across cell lines, identified by a multi-source probabilistic 
model, were the most relevant for retrieval. We also con- 
firmed the utility of the Landmark genes identified in the 
LINCS project. 

In addition to single-drug retrieval, we showed how 
probabilistic connectivity mapping naturally allows 
retrieval of sets of drugs, and showed how such combina- 
torial retrieval provides complementary information to 
single-drug retrieval for drugs with multiple mechanisms 
of action. 

Connectivity mapping has also been proposed for pre- 
dicting synergistic drug combinations given a disease 
query [3]. A straightforward assumption is that drugs with 
similar gene expression signatures could be synergistic, 
and a successful in vivo proof-of-concept of this approach 
has been reported by Hassane et al. [15]. An alternative 
assumption is to search for drug combinations with either 
completely independent actions or actions on different 
but related targets or pathways [16,17], and our proposed 
combinatorial retrieval method could provide hypotheses 
for such combinations. 



Based on the drug similarity validation with the 
CMap data, probabilistic connectivity mapping provides 
a promising alternative for earlier methods. Next, the 
method could be applied to matching known drugs and 
drug combinations to disease samples, providing hypothe- 
ses of novel therapies. 

For the current CMap data, the absolute retrieval per- 
formance was at its best when all factors were used for 
defining the relevance, even though for smaller numbers 
of factors the shared ones were more informative. We 
expect this to change when the datasets become larger and 
more heterogeneous, requiring more expertise from the 
user to choose a set of informative cell lines, or even more 
advanced tools to model the users' interests. 

As the LINCS -project will generate data over tens of 
cell lines, we also expect other benefits of the Group Fac- 
tor Analysis -based probabilistic connectivity mapping to 
become even more apparent. Being able to identify both 
shared responses across a large number of cell types, and 
on the other hand responses specific only to few cell lines, 
will be highly valuable to drug development and discov- 
ery. It would be even possible to impose more structure 
on the Group Factor Analysis model, inferring which cell 
lines response similarly to the drugs, providing potentially 
highly relevant information for personalized medicine 
approaches. 

The recent work by Iskar et al [5] used a biclustering 
approach to identify important response modules from 
the CMap data, and identify shared modules based on 
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overlapping genes as a post-processing step. They pro- 
posed using the modules to match drugs, even though 
they did not proceed to recommending particular metrics. 
They did, however, demonstrate drug repositioning by 
validating some examples from both shared and cell-line- 
specific modules, suggesting that a suitable probabilistic 
biclustering method (such as [18]) could be usable for 
probabilistic connectivity mapping as well. 

Methods 

Data 

We used the Connectivity Map (CMap) build 2 drug- 
treatment transcriptional data [1]. The data was RMA- 
normalized [19], and we included measurements only 
from the HT-HG_U133A microarray platform, for drugs 
that were measured on all three of the most prominent cell 
lines (MCF7, PC3, HL60). To follow the state-of-the-art 
preprocessing procedure by Iskar et al. [14] we included 
treatments only from the large CMap batches with around 
40 measurements, ignoring the small batches with at most 
6 measurements. 

For each drug and cell line pair, we included only the 
highest concentration. Differential expression was com- 
puted against the mean of the treatment measurements 
for each batch, instead of the biological controls, as sug- 
gested by Iskar et al. [14]. Remaining replicates of drug 
and cell line pairs were merged by averaging. This resulted 
in drug- treatment gene expression profiles for 718 drugs 
for the three cell lines. We additionally re-computed the 
preprocessing by including treatments from all batches. 
This resulted in the addition of only 1.5 % more treatments 
and no new chemicals, and hence the results for all meth- 
ods, and conclusions, were expectedly practically identical 
to those using only the large batches. 

Instead of the full genome, we used the set of Landmark 
genes provided by the LINCS project (http://lincscloud. 
org/ the- landmark- genes/). This set of about 1000 genes 
has been curated based on large gene expression com- 
pendium to be minimally redundant, widely expressed 
in various cellular contexts, and largely representative of 
the full genome. Using this particular set of genes is thus 
expected to result in a higher signal-to-noise ratio in the 
data, as compared to the full genome. The retrieval perfor- 
mance using the Landmark genes was indeed better for all 
methods as compared to using the full genome (results not 
shown), confirming that using them is a sensible choice 
for connectivity mapping. Of the 968 Landmark genes 
provided by LINCS, 930 were present in the CMap data. 

Rank-based connectivity mapping 

Existing connectivity mapping methods use a gene set 
enrichment-based [6] measure for matching drugs [1]. In 
this paper, we use the method described by Iorio et al. 
[2]: The genes were first ranked based on differential 



expression. For each drug, the ranked gene lists from 
the different cell lines were then merged by the Kru- 
Bor rank aggregation method. A consensus gene signature 
was then produced by taking the top up- and down- 
regulated genes from the merged list. The query drug 
was matched to other drugs in the database by comput- 
ing the Kolmogorov-Smirnov statistics based enrichment 
score between the query signature and the ranked lists 
of the other drugs. We tried both average and maximum 
enrichment-score distances (AESD, MESD), AESD giving 
better retrieval performance. Using the full genome, Iorio 
et al. [2] identified 250 genes as an optimal signature size. 
However, as we are using only the 930 Landmark genes, 
we re-validated the signature size, resulting in the best 
retrieval performance with a signature size of 50 genes 
(results not shown). 

Probabilistic connectivity mapping with Group Factor 
Analysis 

Factor analysis (FA) is a standard data analysis tool for 
capturing and understanding linear relationships between 
variables [20] . It uses a set of K factors to explain depen- 
dencies between the features in a data matrix X e R NxD : 

X = ZW r + E , (1) 

where the columns of Z are the K unobserved factors, 
W g R Dx/<r contains their loadings, and E is Gaussian 
residual noise. Different factor analysis variants can be 
defined by choosing specific priors for the loadings W and 
structure for the residual noise E. 

Group Factor Analysis (GFA) was recently introduced 
[11] for generalizing from modeling of dependencies 
between scalar variables, which FA does, to modeling 
dependencies between data sets. In the machine learning 
community, learning from multiple sources of data has 
been called multi-view learning, views referring to data 
sets with shared (or co-occurring) samples. Given a col- 
lection Xi, . . . , Xm of M views, here cell lines, with shared 
samples and dimensionalities Di, . . .,Dm> the task is to 
find K factors that describe the collection and in particu- 
lar the dependencies between the data views X m . For sim- 
plicity, we assume normally distributed data. This choice 
can of course be tailored if there's more prior knowledge. 
In this paper, the assumption is validated based on exter- 
nal retrieval validation. The likelihood for observed data 
Xis 

M 

p(X\W, Z, r) = Yi A/"(X m |ZW^, r- 1 !) . (2) 

m—l 

Now the noise E in Equation 1 is diagonal [t^ 1 , . . . , t^ 1 ] 
with each t" 1 repeated D m times. Hence, every dimen- 
sion within view m has the same noise variance, whereas 
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the views may have different variances. A Gamma prior is 
used for the inverse variances r m : 

M 

p(T\a T ,b T )= Y\g(r m \a T ,b T ). (3) 

m—l 

The factors Z are assumed to be normally distributed 
with zero mean and unit covariance: 

/i(z)~^(0,I). (4) 

The weight matrix W is made group-sparse by a group- 
wise automatic relevance determination (ARD) prior, 

M K 

p{*\a\b") = n n^^i""'*") (5) 

m=l k=l 

K M D m 

p(W) = p(W\a) = [I 11 T\^(™m,k(d)\0,a-\) , 

k=l m=l d=l 

(6) 

where w m ^(<i) denotes the dth element in the projection 
vector w m ,fc- The inverse variance of each vector is con- 
trolled by the parameter a m> k with a Gamma prior. The 
hyperparameters a x ' , b T , a a and b a are set to very small 
values, here 10 -14 . 

The ARD makes groups of variables inactive for spe- 
cific factors by driving their a~\ to zero, providing factors 
that are active for only a specific subset of the views. The 
ability of GFA to separate shared and specific effects is 
the core of the model, distinguishing it from earlier fac- 
tor analysis models. The ARD prior is simultaneously used 
to control the model complexity, that is, the number of 
factors, by shutting down unused factors during the infer- 
ence. There are other alternatives for the ARD prior that 
could be explored in the future. Model inference is carried 
out with a variational approximation, using the R package 
CCAGFA available in CRAN [11]. Details of the inference 
are given in the Appendix. 

To evaluate the benefits from the multi-view Group Fac- 
tor Analysis for probabilistic connectivity mapping, we 
compare it to two alternative formulations of the factor 
analysis problem that do not use the multi-view infor- 
mation. For this, we concatenate all data into a single 
data matrix X. First, we assume that the noise variance 
is equal over the variables, reducing the factor analy- 
sis to the Bayesian principal component analysis (BPCA) 
[21]. Second, we assign each feature an independent 
ARD-prior, resulting in a sparse factor analysis model 
(sFA, [22]). 

Given the set of factors Z, identified by the model 
applied on a collection of drug-treatment measurements 
from multiple cell lines, the probabilistic connectiv- 
ity mapping procedure is completed by computing the 



relevance measures between pairs of drugs. We define the 
relevance between drugs i and j as the Pearson correlation 
between the latent variables z; and zj. The correlation- 
based relevance measure has the favorable property of 
focusing on the active (non-zero) factor values, represent- 
ing relevant activity for the query. The measure is addi- 
tionally normalized by definition, removing the effects of 
varying norms of the samples. Depending on the task of 
the analyst, the relevance can be computed over all or a 
subset of the factors, for example only the factors shared 
by two or more views. In this paper, we use data from 
all three cell lines in the CMap data, preprocessed as in 
[14], to allow fair comparison with the alternative meth- 
ods. However, the model could be learned from only a 
subset of the cell lines as well. 

Combinatorial retrieval 

There are many situations where single-drug connectiv- 
ity mapping does not provide fully satisfactory results. 
For example, many drugs activate multiple targets and 
biological processes, which is called polypharmacology 
[12,13]. If we assume that a query drug q activates two 
distinct biological processes, single-drug retrieval would 
tend to provide relevant matches to only the most dom- 
inant one of them, whereas an optimal retrieval result 
would cover them both. This can be achieved with com- 
binatorial retrieval, where pairs (or more) of drugs are 
searched for instead of single drugs, such that each drug 
in the pair matches to one of the active processes of the 
query. This can be formulated as an extension of the prob- 
abilistic connectivity mapping to combinatorial retrieval. 
The goal is then to search for the pair p of drugs i and 
j that jointly explain the query activity better than any 
single drug. This is achieved by combining the factor pro- 
files of the pair of drugs into a single factor profile z p 
such that it maximizes the relevance, i.e. cor(z p ,z q ). For- 
mally, z p = {z P)k },k e {1, . . .,K},z Pj k e {z i} k,z j}k }. In other 
words, each factor value z P) k is chosen from either z; or z ; , 
and the choices are made to maximize cor(z p , z q ). 

Validation 

To validate the probabilistic connectivity mapping 
approach, we use two external ground truth data sets of 
known drug similarity as in [14]: Shared ATC codes and 
chemical similarity. According to the first set, drugs are 
considered functionally similar if they share the level four 
Anatomic Therapeutic Chemical (ATC) classification 
codes [23]. The ATC is a hierarchical grouping of drugs 
based on the organ or systems on which they act, and 
their therapeutic, pharmacological, and chemical proper- 
ties. The alternative is to consider two drugs (chemically) 
similar if the Tanimoto similarity between their 2D fin- 
gerprints is higher than 0.8. Tanimoto similarities are 
computed using the redk R package [24]. 
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Two different goodness measures are computed for 
the retrieval, given a ranked list of other drugs for the 
query drug, and external ground truth stemming from 
either Tanimoto or ATC The first is partial area under 
the ROC curve (FPR < 0.1) over the pooled set of all 
drug pair similarities, as in [14]. The second is top- 10 
mean average precision (MAP), a standard goodness mea- 
sure in information retrieval. The two goodness measures 
focus on different, complementary aspects of retrieval 
performance: Partial AUC focuses on the overall short- 
est distances, which the user might want to explore, 
emphasizing the cases where relevant matches for the 
drugs are easily found. The top- 10 MAP, in contrast, 
is a mean over all query drugs, giving equal weight 
also to those drugs for which a match is harder to 
find. 

To validate the combinatorial retrieval approach, we 
constructed a setup for testing the ability of the model 
to retrieve relevant drugs for a given polypharmacologic 
query drug. In particular, we used the subset of drugs with 
multiple ATC code assignments as queries. The results 
were ranked based on both single-drug retrieval and com- 
binatorial retrieval, and the top rank positions in which 
each ATC code shared with the query first appeared in 
the lists were found. We then computed the proportion 
of query drugs with at least one ATC code for which the 
combinatorial retrieval gives an improved ranking com- 
pared to single retrieval, using ATC code levels from one 
to four. The rationale is that if one ATC label dominates 
the effects, it is likely to appear high in the standard (sin- 
gle drug) retrieval, whereas other minor effects related 
to other ATC(s) may be further down in the results 
list. Combinatorial retrieval, however, also allows minor 
results to appear in the top ranks. By jointly evaluat- 
ing all the ATC codes for the query drug, we can see 
whether combinatorial retrieval finds drugs that match 
the ATC codes but do not show up high on standard 
retrieval. 

As there are some drugs that share the same multiple 
ATC codes, those are likely to be found by single-drug 
retrieval more easily. We thus additionally evaluated the 
setup where such drugs are removed from the set of 
drugs retrieved; this should highlight how many addi- 
tional drugs the combinatorial retrieval can find. 

Availability and requirements 

Project name: Probabilistic connectivity mapping 
Project home page: http://research.ics.aalto.fi/mi/ 
software/ProbCMap/ 

Operating systems: Platform independent 
Programming language: R 
Other requirements: None 
License: FreeBSD 

Any restrictions to use by non-academics: No 



Appendix 

The full posterior distribution of the GFA model is 

p(0\X) =p(X\Z,W,a,T)p(Z)p(W\a)p(a\a a ,b 0( ) 
p(r\a T ,b T )/p(X). 



(7) 



For the variational inference, the posterior is approxi- 
mated as 

p(0\X) * q(0) = q(Z)q(W)q(a)q(r) . (8) 
The latent factors are updated as 

N N 

q(Z) = Y\ q(zd = Y\M(Ti\raf \ Z<*>) , (9) 

i=l i=l 

where the parameters are: 

5^) = (l k + ^<r m )(W (m) W (m)T A 

\ ra=l / 
M 

m=l 

The projection matrices are updated as 

M D m 

q(W ) = n n^ w SV3' ^ > do) 

m=l j=l 

where wfj*' denotes the y'th column of matrix W (m \ 



nv 



and a m is the mth row of a transferred into a diagonal 
K x K matrix. 

The noise precision q(r) = Y\m=l G(r m \a T m , b x m ) parame- 
ters are updated as 

r r , D m N 

a m = a + ~Y~ 

N 



^ = ^ T + ^EH ra) - w(m)T ^ 2 >- 

i—1 



The ARD precision q(a) = Um=i YlLi Q^ mk \a a m ,b a mM ) 
parameters are updated as 

a a , Dm 
a m= a + ~Y 

( (m) (m)T } 
m,k 1 2 
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